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Abstract 

Pattern learning in an important problem in Natural Language Processing (NLP). Some exhaus- 
tive pattern learnin g (EPL) meth ods Bod (1992) were proved to be flawed Johnson (2002), while 
similar algorithms lOch and Ne vl (E004l) showed great advantages on other tasks, such as machine 
translation. In this article, we first formalize EPL, and then show that the probability given by an 
EPL model is constant-factor approximation of the probability given by an ensemble method that 
integrates exponential number of models obtained with various segmentations of the training data. 
This work for the first time provides theoretical justification for the widely used EPL algorithm in 
NLP, which was previously viewed as a flawed heuristic method. Better understanding of EPL may 
lead to improved pattern learning algorithms in future. 
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1. Introduction 

Pattern learning is the crux of many natural language processing (NLP) problems. It is usually 
solved as grammar induction for these problems. For parsing, we learn a statistical grammar with 
respect to certain linguistic formalism, such as Context Free Grammar (CFG), Dependency Gram- 
mar (DG), Tree Substitution Grammar (TSG), Tree Adjoining Grammar (TAG), and Combinatory 
Categorial Grammar (CCG) etc. For machine translation (MT), we learn a bilingual grammar that 
transfer a string or tree structure in a source language into a corresponding string or tree structure in 
a target language. 

What is embarrassing is that many of the grammar induction algorithms that provide state- 
of-t he-art per f orman c e are usually r egarde d as less pri ncip l ed in the aspect of statistical model- 
ing. IjohnsonI (l2002h : IPrescher et all (l2004h showed the lBodl d 19921) 's data oriented parsing (DOP) 
algorithm is biased and inconsistent. In the MT field, almost all the statistical MT models pro- 
posed in recent ye a rs rely on s imilar heuristic methods t o extract translation grammars, s uch a s 
Koehn et al l J2003l):IOch and Nev (2004 ): Chiang ( 20oi): lOuirk et all (l2005h : iGallev et all (l2006l): 
Shen et al.l (|2008l ): ICarreras and CoUinsI (|2009|) . to name a few of them. Similar heuristic methods 
have also been used in many o ther pattern learnin g tasks, for example, like sem antic parsing as in 
Zettlemoyer and CoUinsI (|2005r) and chunking as in lPaume III and Marcul (|2005|) in an implicit way. 

In all these heuristic algorithms, one needs to extract overlapping structures from training data 
in an exhaustive way. Therefore, in the article, we call them exhaustive pattern learning (EPL) 
methods. The use of EPL methods is intended to cope with t he unc ertainty of building blocks used 
in statistical models. As far as MT is concerned, iKoehn et al.! (|2003h found that it was better to define 
a translation model on phrases than on words, but there was no obvious way to define what phrases 
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were. 



DeNero et al.l (l2006h observed that exhaustive pattern learning outperforms generative models 



with fixed building blocks. 

In EPL algorithms, one needs to collect statistics of overlapping structures from training data, 
so that they are not valid genera tive models. Thus, t he EPL alggrithni s for gram mar induction 
were v i ewed as heurist i c met h ods DeNero et al.l (l2006h: Daume IIll (l2008l). Recently. lOeNerq et al 



(l2008h : lBlunsom et al.l(l2009h : ICohn and Blunsoml(l2009h : ICohn et al.l(l2009h : IPost and Gildeal (l2009h 
investigated various sampling methods for grammar induction, which were believed to be more 
principled than EPL. However, there was no convincing empirical evidence showing that these new 
methods provided better performance on large-scale data sets. 

In this article, we will show that there exists a mathematically sound explanation for the EPL 
approach. We will first introduce a likelihood function based on ensemble learning, which marginal- 
izes all possible building block segmentations on the training data. Then, we will show that the 
probability given by an EPL grammar is constant-factor approximation of an ensemble method that 
integrates exponential number of models. Therefore, with an EPL grammar induction algorithm, we 
learn a model with much more diversity from the training data. This may explain why EPL methods 
provide state-of-the-art performance in many NLP pattern learning problems. 

The rest of the article is organized as follows. We will first formalize EPL in Section |2l In 
Section [3l we introduce the ensemble method, and then show the approximation theorem and its 
corollaries. We discuss a few important problems in Section HI and conclude our work in Section |5] 



2. Formalizing Exhaustive Pattern Learning 

For the purpose of formalizing the core idea of EPL, we hereby introduce a task called monotonic 
translation. Analysis on this task can be extended to other pattern learning problems. Then, we will 
define segmentation on training data, and introduce the EPL grammar, which will later be used in 
Section[3l theoretical justification of EPL. 



2.1 Monotonic Translation 

Monotonic translation is defined as follows. The input x G ^ is a string of words xiX2...Xi in the 
source language. The monotonic translation of x is y G 3^, a string of words, yiy2---yi, of the same 
length in the target language, where yj is the translation of Xj, 1 < j < i. 

In short, monotonic translation is a simplified version of machine translation. There is no word 
reordering, insertion or deletion. In this way, we ignore the impact of word level alignment, so as 
to focus our effort on the study of building blocks. We leave the incorporation of alignments for 
future work. In fact, we can simply view alignments as constraints on building blocks. Monotonic 
translation is already general enough to model many NLP tasks such as labelling and chunking. 



2.2 Training Data Segmentation and MLE Grammars 

Without losing generality, we assume that the training data D contains a single pair of word strings, 
X/) and y/), which could be very long. Let x/) = xiX2---Xn, and yd - yiy2---yn- Source word Xj is 
aligned to target word yi. Let the length of the word strings be IL*! = n. Figured] shows a simple 
example of training data. Here \D\ =4. 

We assume that there exists a hidden segmentation on the training data, which segments x/5 and 
Yd into tokens. A token consists of a string of words, either on source or target, and it contains 
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Figure 1 : An example of training data for monotonic translation. 
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Figure 2: An example of segmentation on training data. 



at least one word. As for monotonic translation, the source side and the target side share the same 
topology of segmentation. Tokens are the building blocks of the statistical model to be presented, 
which means that the parameters for the model are defined on tokens instead of words. 

A segmentation s/j of D, or s for short, is represented as a vector of n — 1 Boolean values, 
siS2---Sn-i- Sj = if and only if Xi and Xj+i belong to the same token, s applies onto both the 
source and the target in the same way, which means Xi and Xj+i belong to the same token if and 
only if Ui and y^+i belong to the same token. 

If we segment D with s, we obtain a tokenized training set Ds- Ds contains a pair of token 
strings (ug, Vg). Ug = uiU2---U\£)^\, and Vg = viV2---V\£,^\, where \Ds\ is the total number of tokens 
in Ug or Vg. Figured shows an example of segmentation on training data. Here, S2 = 0, so that we 
have a token pair that spans two words, (^25 ^^2) = (LEFT FOR, went to). 

Given training data D and a segmentation s on D, there is a unique joint probabilistic model 
obtained by the MLE on Ds- Each parameter of this model contains a source token and target token. 
Since each token represents a string of words, we call this model a string-to-string grammar Gds- 
Specifically, for any pair of tokens (n, v), we have 

Pr{u,v\GDs) = |„ I , (1) 

where #g(u, v) is the number of times that this token pair appears in the segmented data Dg. 
As for the example segmentation s in Figure |2j its MLE grammar is simply as follows. 

Pr(SOPHIE, Sophia I Gds) = 1/3 
Pr (LEFT FOR, went to | ) = 1/3 
Pr(PHILLY, Philadelphia I Gds) = 1/3 

However, for any given training data, its segmentation is unknown to us. One way to cope with 
this problem is to consider all possible segmentations. String distribution on the training data will 
lead us to a good estimation of the hidden segmentation and tokens. In Section [3l we will intro- 
duce an ensemble method to incorporate MLE grammars obtained from all possible segmentations. 
Segmentations are generated with certain prior distribution. 
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2.3 Exhaustive Pattern Learning for Monotonic Translation 



Now we present an EPL solution. We follow the widely-used heuristic method to generate a gram- 
mar by applying various segmentations at the same time. We build a heuristic grammar ^ out 
of the training data D by counting all possible token pairs (u, v) with at most d words on each side, 
where d <C |D| is a given parameter 



Pr{u,v\GD,, 



where v) is the number of times that the string pair encoded in (u, v) appears in and 

d-1. 



E 

{u',v') 



#{u\v')= 5; (|i?|-i + l) = (l 



i=l...d 



2\D\ 



-)d\D\ 



Therefore, 



Pr{u,v\GD,a 



#{u,v) 



'^-\)d\D\ 



(2) 



For example, the heuristic grammar for the training data in Figure[T]is as follows if we set d = 2. 



Pr (SOPHIE, Sophia 


\Gd,2) - 


= 1/7 


Pr (LEFT, went 


\Gd,2) - 


= 1/7 


Pr(FOR, to 


\Gd,2) - 


= 1/7 


Pr(PHILLY, Philadelphia 


\Gd,2) - 


= 1/7 


Pr(SOPHIE LEFT, sophia went 


\Gd,2) - 


= 1/7 


Pr (LEFT FOR, went to 


\Gd,2) - 


= 1/7 


Pr (FOR PHILLY, to Philadelphia 


\Gd,2) - 


= 1/7 



A desirable translation rule 'LEFT FOR went to' is in this heuristic grammar, although its weight 
is diluted by noise. The hope is that, good translation rules will appear more often in the training 
data, so that they can be distinguished from noisy rules. 

In the decoding phase, we use grammar GD,d as if it is a regular MLE grammar. Let x = 
xiX2---Xi be an input source string. For any segmentation a on the test sentence x, let Ua = 
uiU2---Uk be the resultant string of source tokens. The length of the string is |x| = i, and the length 
of the token string is |ua| = A;. The translation that we are looking for is given by the target token 
vector V, such that 



(v,a) = argmaxPr(ua, vIG/) (i), where 

(v,a> 

Pr(ua, v|Gz),d) = Y\. P'r{uj,Vj\GD,d) 

i = l...|Ua| 



(3) 



TT "'j 



i = l...|Ua| 



1 . For the sake of convenience, in the rest of this article, we no longer distinguish a token and the string contained in 
this token unless necessary. We use symbols u and v to represent both. The meaning is clear in context. 
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where mj = ^{uj,Vj). As in previous work of structure-based MT, we do not calculate the 
marginal probability that sums up all possible target tokens generating the same word string, due to 
the concern of computational complexity. 

Obviously, with GD,d, we can take advantage of larger context of up to d words. However, 
a common criticism against the EPL approach is that a grammar like G£),d is not mathematically 
sound. The probabilities are simply heuristics, and there is no clear statistical explanation. In the 
next section, we will show that G^i ^ is mathematically sound. 



3. Theoretical Justification of Exhaustive Pattern Learning 

In this section, we will first introduce an ensemble model and a prior distribution of segmentation. 
Then we will show the theorem of approximation, and present corollaries on conditional probabili- 
ties and tree structures. 



3.1 An Ensemble Model 

Let D be the training data of |D| words. Let s be an arbitrary token segmentation on D, where s is 
unknown to us. Given D and s, we can obtain a model/grammar G ds with maximum likelihood es- 
timation. Thus, we can calculate joint probability of {uj,Vj) given grammar Gds, Pr{uj,Vj\GDs)- 
There are potentially exponential number of distinct segmentations for D. Here, we use an 
ensemble method to sum over all possible segmentations. This method would provide desirable 
coverage and diversity of translation rules to be learned from the training data. For each segmenta- 
tion s, we have a fixed prior probability Pr(s) which we will shown in Section [3^ Thus, we define 
the ensemble probability L{uj,Vj) as follows. 



L{uj,Vj) = J]Pr(n,-,^;,|GDs)Pr(s). (4) 

s 

Prior segmentation probabilities Pr(s) serve as model probabilities in dH). Having the rn odel 



probabilities fixed in this way could avoid over-fitting of the training data lDeNero et al.l (l2006h . 



In decoding, we search for the best hypothesis v given training data D and input x as follows. 

(v,a) = argmaxL(ua, v), where 

(v,a> 

L(Ua,v) = Y[ L{uj,Vj) 

j = l...|Ua| 

What is interesting is that there turns out to be a prior distribution for s, such that, under certain 
conditions, the limit of L(ua, v)/Pr(ua, v|G/) rf) as \D\ — )• oo is a value that depends only on |x| 
and a parameter of the prior distribution Pr{s), to be shown in Theorem [3] |x| is a constant for 
all hypotheses for the same input. Therefore, Pr(ua, vjG^i rf) is constant-factor approximation of 
L(ua, v). Using GD,d is, to some extent, equivalent to using all possible MLE grammars at the 
same time via an ensemble method. 



3.2 Prior Distribution of Segmentation 

Now we define a probabilistic model to generate segmentation, s = (si, S2, is a vector 

of |D| — 1 independent Bernoulli variables. Sj represents whether Xj and Xj+i belong to separated 
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tokens. 1 means yes and means no. All the individual separating variables have the same distribu- 
tion, Pg{si = 0) = q and Pq{si = 1) = 1 — (7, for a given real value q, < q < I. Since L(ua, v) 
depends on q now, we rewrite it as Lg(ua, v). 

Based on the definition. Lemma [T] immediately follows, which will be used later. 

Lemma 1 For each string pair {u, v), the probability that an appearance of (n, v) in D is exactly 
tokenized as u and v by s is — q)"^. 

3.3 Theorem of Approximation 

Let X = xiX2---Xi be an input source string. Let a be a segmentation on x, and the resultant token 
string be Ua = uiU2...Uk. Let v = viV2---Vk be a hypothesis translation of Ua. Let nij = #{uj,Vj), 
the number of times that string pair {uj,Vj) appears in the training data D, 1 < j < k. Let 
"ij,s = i^siuj,Vj), the number of times that this token pair appears in the segmented data Ds- In 
order to prove Theorem we assume that the following two assumptions are true for any pair of 
tokens {uj,Vj). 

Assumption 1 Any two of the mj appearances in D are neither overlapping nor consecutive. 

This assumption is necessary for the calculation of E[mj g], ^ < j < k. Based on LemmalU the 
number of times that {uj , Vj ) is exactly tokenized as in this way with segmentation s is in a binomial 
distribution i?(mj, — q)"^), so that 

E[m,-,] = mjq\''^\-\l - qf , 

where \uj \ is the number of words in Uj. In addition, since there is no overlap, these appearances 
cover a total of \uj\mj source words. 

Assumption 2 Let rjj = . We have lim\j^\^^ rjj = 0. 

In fact, as we will see it in Section 14.11 we do not have to rely on Assumption [2] to bound the 
ratio of Pr(ua, ^{Gd^) and Lg(ua, v). We know that r/j is a very small positive number, and we 
can build the upper and lower bounds of the ratio based on rjj . However, with this assumption, it 
will be much easier to see the big picture, so we assume that it is true in the rest of this section. 

Theorem 2 Suppose Assumptions\J\and^holdfor a given pair of tokens {uj,Vj), then we have 



|Z)|^oo Pr{uj,Vj\GD,d) 
where q = d/{d+ 1). 

Later in the section, we will show Theorem |2] with Lemmas |4] and |5] Theorem |3] immediately 
follows Theorem |2] 

Theorem 3 Suppose Assumptions\I}and\2\hold for any j, then we have 

lim ^'^("-'^^ = q\-\, 

IDI^oo Pr(Ua, v|Gz),d) 

where q = d/{d + 1). 
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H 



D I I I I I I I I III I 

s 6- 4^ '\o 1 1 '\i ,'6 1 /i q.- o'.- i' 



I I I I I I I I I I I 

SO(I) 1 1 1 1 

S1(I) 1 i 1 1 i 1 i 

Figure 3: An example of training data splitting. 

Here, |x| is a constant for hypotheses of the same input. An interesting observation is that the prior 
segmentation model to fit into this theorem tends to generate longer tokens, if we have a larger value 
for d. 

We will show Theorem|2]by bounding it from above and below via Lemmas|4]and|5]respectively. 
Now, we introduce the notations to be use the proofs of Lemmas |4] and |5] in Appendixes A and B 
respectively. 

First, we combine ([B and and obtain 

L,{uj,Vj) = ^s(^). (5) 

With Assumption [U we know the value of E[mj s]- However, iZ^sl depends on m,j s, and this 
prevents us from computing the expected value on each individual item. 

We solve it by bounding \Ds \ with values independent of mj s, or the separating variables related 
to the rrij appearances in D. We divide D into two parts, H and /, based on the mj appearances of 
{uj , Vj ) pairs. H is the part that contains and only contains all nij appearances, and / is the rest of 
D, so that the internal separating variables of I are independent of mj^s- An example is shown in 
Figure |3] Black boxes represent the ruj appearances. 

We concatenate fragments in / and keep the /-internal separating variables as in s. There are 
two variants of the segmentation for /, depending on how we define the separating variables between 
the fragments. So we have the following two segmented sub-sets. 

• /so(/)" inter-fragment separating variable = 0. 

• ^si(/)' inter-fragment separating variable = 1. 

Here, sq (/) and si (I) represent the two segmentation vectors on / respectively, each of which has 
\I\ — nij — 1 changeable separating variables, where |/| is the number words contained in I. The 
number of changeable variables that set to 1 follows a binomial distribution B{\I\ — nij — 1,1 — q). 
In Figure [3l fixed inter-fragment separating variables are represented in the bold italic font. 

If there are s changeable variables set to 1 in /so(/)' "^^e number of tokens in /so(/) l^so(/)l — 
s + 1. Similarly, if there are s changeable variables set to 1 in /si(/)> the number of tokens in /si(/) 
is = s + ruj + l. 
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In addition, it is easy to verify that 



l4o(/)l < l^sl (6) 
\Ds\ < 14,(7)1 + |Mj|mj (7) 

Combining ([5]l, Q and the two assumptions, we have the upper bound in LemmalU 

Lemma 4 If Assumptions\I}and\2\hold, 

lim ^-'^(^^-f^) <gKI 
\D\->-oo Pr{Uj,Vj\GD,d) 

where q = d/{d + 1). 

Similarly, combining (Q and the two assumptions, we obtain the lower bound in Lemma|5] 
Lemma 5 If Assumptions\I}and\2\hold, 

Lq{Uj,Vj) ^ 
\D\^oo Pr[Uj,Vj\GD,d) 

where q = d/{d + 1). 

The proofs of Lemmas |4] and |5] are given in the Appendixes A and B respectively. The proof of 
Lemma m also depends on Lemma [8] Lemma [8] and its proof are given in Appendix C. 
Therefore, Theorem |2]holds. 

3.4 Corollaries on Conditional Probabilities 

Theorem |2] is for joint distribution of token pairs. In previous work of using EPL, conditional 
probabilities were ofter used, for example, like P{u\v) and P{v\u). Starting from Theorem |2] we 
can easily obtain the following corollaries for conditional probabilities. 

Corollary 6 Suppose Assumptions\J\and^hold for a given pair of tokens {uj,Vj), then we have 



lim Pr{u,\v,,Go4)l J:'^^';'''\ = 1, 



where q = d/{d + 1). 



Proof According to the definition, Pr{u,Vj\GD4) ~ ^ ^'^^ Lq{u,Vj) = 0, if \u\ 7^ \vj\. 
Therefore, we only need to consider all pairs of (u, Vj), such that \u\ = \vj\ = \uj\. The number of 
distinct n is a finite number, since source vocabulary is a finite set. Therefore, according to Theorem 
|2j for any small positive number e, there exists a positive number n, such that, if |-D| > n, we have 

(1 - e)^^^ < Pr{u,v,\Gn,d) < (1 + e)^^^^ (8) 
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Therefore, we have 



Pr{uj,Vj\GD,d) 



Pr{uj\vj,GD,d) = „ . ir< \ 



_ 1 + e Lg{uj,Vj) 



1 - eE« Lg{u,Vj 



Thus, 



Lg{uj,Vj) ^ 1 + e 



Similarly, we have 



Fr(„,|„,,Oc.,)/^M5il2) > 



Therefore, 



lim Priu,\v„Gn,d)/^^Pi^,=-^ 



Corollary 7 Suppose Assumptwns\J\and^hold for a given pair of tokens {uj,Vj), then we have 



lim Pr{v,\u„Gn,d)/^^P^, = 1, 



where q = d/{d + 1). 



The proof of Corollary His similar to that of Corollary Therefore, conditional probabilities in 
EPL are reasonable approximation of the conditional ensemble probability functions. 

The proofs for the conditional probabilities depend on a special property of monotonic transla- 
tion; the length of Uj is the same as the length of Vj. However, this is not true in real application 
of phrase-based translation. The source and the target sides may have different segmentations. We 
leave the modeling of real phrase-based translation for future work. 



3.5 Extension to Tree Structures 

Now we try to extend Theorem |2] to the string-to-tree grammar. First, we define a prior distribution 
on tree segmentation. We assign a Bernoulli variable to each tree node, representing the probabilities 
that we separate the tree at this node, i.e, with probabilities of 1 — g, we choose to separate each 
node. 

Let (uj, Vj) be a string-tree pair, where Uj is a source string and Vj is a target tree. Let tj be the 
number of words in Uj, and let rij be the number of non-terminals in Uj, where tj + rij < d, and 

tj is the length of the input sentence, |x|. Thus, the probability that an appearance of {uj,Vj) in 
D is exactly tokenized as in this way is — 
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With similar methods used in the proofs for string structures, we can show that, if Assumptions 
[U and E] hold, 

\D\^oo Fr{Uj,Vj\GD,d) 

where c = I^j.f^+„^<rf#(^tj,Wj)/|-D[ is a constant, and g is a free parameter. We skip the proof 
here to avoid duplication of similar procedure. We define 

Pg4{uj,Vj) = cq^^-^{l-q)''^Pr{uj,Vj\GD,d)- 
Thus, Pq^d{uj,Vj) approximates Lq{uj,Vj), where 

lim = 

This result shows a theoretically better way of using heuristic grammar in string-to-tree models. 
4. Discussion 

In this section, we will focus on three facts that need more explanation. 

4.1 On the Use of Assumption |2] 

In the proofs of Lemmas |4] and [5j Assumption |2] is only used in the very last steps. Therefore, we 
could build the upper and lowers bounds of the ratio without Assumption |2] by connecting Inequal- 
ities (fT2l ) and ( fT3l ) in Appendixes A and B respectively. 

4.2 On the Ensemble Probability 

The ensemble probability in ([Hi can be viewed as simplification of a Bayesian model in 

L(uj,Vj) = Pr(uj,Vj\D) 

= Yl Pr{uj,Vj\G)Pr{G\D) (9) 
GeGiD) 

In (|9l), we marginalize all possible token-based grammars G from D, G{D). Furthermore, 

Pr{G\D) = YPr{G\D,s)Pr{s\D) 

s 

Then, we approximate the posterior probability of G given D and s with point estimation. Thus, 
Pr{G\D, s) = 1 if and only if G is the MLE grammar of Ds, which means all the distribution mass 
is assigned to Gds, the MLE grammar for Ds. We also assume that s is independent of D. Thus, 

Pr{G\D) = Y.l{G = GDs)Pr{s), (10) 
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where Pr{s) is a prior distribution of segmentation for any string of [Z^l words. With (ITOl ). we can 
rewrite dill as follows. 



L{uj,vj) = Yl Pr{uj,Vj\G)Y,HG = GDs)Pr{s) 
Gee(D) s 

= E E Pr{uj,v,\G)l{G = GDs)Pr{s) 

s G&g{D) 

= Y,Pr{u,,v,\GD.)Pr{s) (11) 



Equation ([TT]) is exactly the ensemble probability in Equation (01). 



4.3 On the DOP Model 



The E PL method investigated in this article may date back to Data Oriented Parsing (DOP) by iBod 



(119921) . What is special with DOP is that the DOP model uses overlapping treelets of various sizes 
in an exhaustive way as building blocks of a statistical tree grammar. 

In our framework, for each pair {uj,Vj), we can use Uj to represent the input text, and Vj to 
represent its tree structure. Thus, it would be similar to the string-to-tree model in Section [33] Joint 
probability of {uj,Vj) stands for unigram probability Pr(treelet). 

However, the original DOP estimator (DOPl) is quite different from our monotonic translation 
model. The conditional probability in DOPl is defined as Pr (treelet|subroot-label), so that there is 
no obvious way to model DOPl with monotonic translation. Therefore, theoretical justification of 
DOPl is still an open problem. 



5. Conclusion 

In this article, we first formalized exhaustive pattern learning (EPL), which is widely used in gram- 
mar induction in NLP We showed that using an EPL heuristic grammar is equivalent to using an 
ensemble method to cope with the uncertainty of building blocks of statistical models. 

Better understanding of EPL may lead to improved pattern learning algorithms in future. This 
work will affect the research in various fields of natural language processing, including machine 
translation, parsing, sequence classification etc. EPL can also been applied to other research fields 
outside NLP. 
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Appendix A. Proof for Lemma |4] 



L,{uj,Vj) = Es[^] {Eqn. (E) } 

< E,[-^] {Eqn. (il) } 

= E[mj,s]E[ 



l4o(/)l 

{Independence of g and /so(/)} 

j:—^ r(l - {Lemma[ 

{l-q){\I\ -TTij) 

E[m,- 



{l-q)(\I\-m,) 



{Binomial Dist., Assumption Q 



(l-q){\I\-mj) 

mjq\^j^~^{l — q)"^ 
(1 — (7)(|-D| — \uj\mj — rrij) 
mjqM-'^{l - qf 
(l-g)(l-r?,)|Z)| 

q\-^\'\l-q?{l-^\)d 
{l-^)d\D\ (l-g)(l-r?,) 



Pr{uj,Vj\GD,d)- 



(l-g)(l-r?,) 



, (l-q){l-^)d 

Lg{Uj,Vj) 



|D|™oo Pr{uj,Vj\GD,d) 

< {Assumption Ely 
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Appendix B. Proof for Lemma |5] 



Lg{u,,Vj) = Es[-^] {Eqn. ©} 



Esi(/)[|4i(/)|] + \uj\mj 

(1 — q){\I\ — 1 — rrij) + mj + 1 + \uj\mj 
{Binomial Dist., Assumption [T]}- 

(1 - q){\D\ - \uj\mj) + q{l + mj) + \uj\mj 

(1 - q)\D\ + q{\uj\mj + mj) + q 
mjgl"jl-Hl - qf 



{l-q + qV, + jhjm 
(l-|^)d|D| l-q + qr], + jh^ 



> 




> 




Pr{uj,Vj\GD,d)q 



Pr{uj,Vj\GD,d) 



u. 



^.1-1(1 -g)2(l_|_l)rf 

l-q + q'qj + j^ 

{l-q?{l-^)d 
3 1 ! ! 

{l-q + qVj + J^)q 
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Appendix C. Lemma |8] and its Proof 

Lemma 8 Let X be a random variable of Binomial distribution B{n, 1 — q), then 

^X + V (l-g)(n + l) 



^ k + l 



1 n! 



k=0 



+ 1 k\{n - ky. 
1 



(l-g)(n + l) 

^ (/fe + l)!(n- A;)!^ ^ ^' 



k=0 



(l-g)(n + l) 
1 - 

(l-g)(n + l) 
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